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Abstract 

This paper presents stride scheduling, a deterministic schedul- 
ing technique that efficiently supports the same flexible 
resource management abstractions introduced by lottery 
scheduling. Compared to lottery scheduling, stride schedul- 
ing achieves significantly improved accuracy over relative 
throughput rates, with significantly lower response time vari- 
ability. Stride scheduling implements proportional-share con- 
trol over processor time and other resources by cross-applying 
elements of rate-based flow control algorithms designed for 
networks. We introduce new techniques to support dynamic 
changes and higher-level resource management abstractions. 
We also introduce a novel hierarchical stride scheduling al- 
gorithm that achieves better throughput accuracy and lower 
response time variability than prior schemes. Stride schedul- 
ing is evaluated using both simulations and prototypes imple- 
mented for the Linux kernel. 

Keywords: dynamic scheduling, proportional-share resource 
allocation, rate-based service, service rate objectives 

1 Introduction 

Schedulers for multithreaded systems must multiplex 
scarce resources in order to service requests of varying 
importance. Accurate control over relative computation 
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rates is required to achieve service rate objectives for 
users and applications. Such control is desirable across a 
broad spectrum of systems, including databases, media- 
based applications, and networks. Motivating examples 
include control over frame rates for competing video 
viewers, query rates for concurrent clients by databases 
and Web servers, and the consumption of shared re- 
sources by long-running computations. 

Few general-purpose approaches have been proposed 
to support flexible, responsive control over service rates. 
We recently introduced lottery scheduling, a random- 
ized resource allocation mechanism that provides effi- 
cient, responsive control over relative computation rates 
[Wal94]. Lottery scheduling implements proportional- 
share resource management - the resource consumption 
rates of active clients are proportional to the relative 
shares that they are allocated. Higher-level abstractions 
for flexible, modular resource management were also 
introduced with lottery scheduling, but they do not de- 
pend on the randomized implementation of proportional 
sharing. 

In this paper we introduce stride scheduling, a deter- 
ministic scheduling technique that efficiently supports 
the same flexible resource management abstractions in- 
troduced by lottery scheduling. One contribution of our 
work is a cross-application and generalization of rate- 
based flow control algorithms designed for networks 
[Dem90, Zha91, ZhK91, Par93] to schedule other re- 
sources such as processor time. We present new tech- 
niques to support dynamic operations such as the modifi- 
cation of relative allocations and the transfer of resource 
rights between clients. We also introduce a novel hier- 
archical stride scheduling algorithm. Hierarchical stride 



scheduling is a recursive application of the basic tech- 
nique that achieves better throughput accuracy and lower 
response time variability than previous schemes. 

Simulation results demonstrate that, compared to lot- 
tery scheduling, stride scheduling achieves significantly 
improved accuracy over relative throughput rates, with 
significantly lower response time variability. In con- 
trast to other deterministic schemes, stride scheduling 
efficiently supports operations that dynamically modify 
relative allocations and the number of clients competing 
for a resource. We have also implemented prototype 
stride schedulers for the Linux kernel, and found that 
they provide accurate control over both processor time 
and the relative network transmission rates of competing 
sockets. 

In the next section, we present the core stride- 
scheduling mechanism. Section 3 describes extensions 
that support the resource management abstractions in- 
troduced with lottery scheduling. Section 4 introduces 
hierarchical stride scheduling. Simulation results with 
quantitative comparisons to lottery scheduling appear in 
Section 5. A discussion of our Linux prototypes and re- 
lated implementation issues are presented in Section 6. 
In Section 7, we examine related work. Finally, we 
summarize our conclusions in Section 8. 

2 Stride Scheduling 

Stride scheduling is a deterministic allocation mecha- 
nism for time-shared resources. Resources are allocated 
in discrete time slices; we refer to the duration of a 
standard time slice as a quantum. Resource rights are 
represented by tickets - abstract, first-class objects that 
can be issued in different amounts and passed between 
clients. 1 Throughput rates for active clients are directly 
proportional to their ticket allocations. Thus, a client 
with twice as many tickets as another will receive twice 
as much of a resource in a given time interval. Client 
response times are inversely proportional to ticket allo- 
cations. Therefore a client with twice as many tickets 
as another will wait only half as long before acquiring a 
resource. 

The throughput accuracy of a proportional-share 
scheduler can be characterized by measuring the differ- 
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ence between the specified and actual number of alloca- 
tions that a client receives during a series of allocations. 
If a client has t tickets in a system with a total of T 
tickets, then its specified allocation after n a consecutive 
allocations is n a x t/T. Due to quantization, it is 
typically impossible to achieve this ideal exactly. We 
define a client's absolute error as the absolute value of 
the difference between its specified and actual number 
of allocations. We define the pairwise relative error 
between clients c, and c, as the absolute error for the 
subsystem containing only c, and c, , where T = ti + tj, 
and n a is the total number of allocations received by both 
clients. 

While lottery scheduling offers probabilistic guaran- 
tees about throughput and response time, stride schedul- 
ing provides stronger deterministic guarantees. For lot- 
tery scheduling, after a series of n a allocations, a client's 
expected relative error and expected absolute error are 
both 0(^/n^). For stride scheduling, the relative error 
for any pair of clients is never greater than one, inde- 
pendent of n a . However, for skewed ticket distributions 
it is still possible for a client to have 0(n c ) absolute 
error, where n c is the number of clients. Nevertheless, 
stride scheduling is considerably more accurate than lot- 
tery scheduling, since its error does not grow with the 
number of allocations. In Section 4, we introduce a 
hierarchical variant of stride scheduling that provides a 
tighter 0(lg n c ) bound on each client's absolute error. 

This section first presents the basic stride-scheduling 
algorithm, and then introduces extensions that support 
dynamic client participation, dynamic modifications to 
ticket allocations, and nonuniform quanta. 

2.1 Basic Algorithm 

The core stride scheduling idea is to compute a repre- 
sentation of the time interval, or stride, that a client must 
wait between successive allocations. The client with the 
smallest stride will be scheduled most frequently. A 
client with half the stride of another will execute twice 
as quickly; a client with double the stride of another 
will execute twice as slowly. Strides are represented in 
virtual time units called passes, instead of units of real 
time such as seconds. 

Three state variables are associated with each client: 
tickets, stride, and pass. The tickets field specifies 
the client's resource allocation, relative to other clients. 



/* per-client state */ 
typedef struct { 

int tickets, stride, pass; 
} *client_t; 

/* large integer stride constant (e.g. 1M) */ 
const int stridel = (1 << 20); 

/* current resource owner */ 
client. t current; 

/* initialize client with specified allocation */ 

void client jnit (client _t c, queue.t q, int tickets) 

{ 

/* stride is inverse of tickets */ 

c->tickets = tickets; 
c->stride = stridel / tickets; 
c->pass = c->stride; 



/*join competition for resource */ 

queue.insert (q, c) ; 



} 



/* proportional-share resource allocation */ 

void allocate (queue.t q) 

{ 

/* select client with minimum pass value */ 

current = queue_remove_min (q) ; 

/* use resource for quantum */ 

use.resource (current) ; 

/* compute next pass using stride */ 

current->pass += current->stride; 
queue.insert (q, current); 



Figure 1 : Basic Stride Scheduling Algorithm. ANSI 
C code for scheduling a static set of clients. Queue ma- 
nipulations can be performed in 0(lgn c ) time by using an 
appropriate data structure. 



The stride field is inversely proportional to tickets, and 
represents the interval between selections, measured in 
passes. The pass field represents the virtual time index 
for the client's next selection. 

Performing a resource allocation is very simple: the 
client with the minimum pass is selected, and its pass 
is advanced by its stride. If more than one client has 
the same minimum pass value, then any of them may be 
selected. A reasonable deterministic approach is to use 
a consistent ordering to break ties, such as one defined 
by unique client identifiers. 

Figure 1 lists ANSI C code for the basic stride 
scheduling algorithm. For simplicity, we assume a static 
set of clients with fixed ticket assignments. The stride 
scheduling state for each client must be initialized via 
client Jnit() before any allocations are performed by al- 
locate(). These restrictions will be relaxed in subsequent 
sections to permit more dynamic behavior. 

To accurately represent stride as the reciprocal of 
tickets, a floating-point representation could be used. 
We present a more efficient alternative that uses a high- 
precision fixed-point integer representation. This is eas- 
ily implemented by multiplying the inverted ticket value 
by a large integer constant. We will refer to this constant 
as stridei , since it represents the stride corresponding to 
the minimum ticket allocation of one. 2 

The cost of performing an allocation depends 
on the data structure used to implement the client 
queue. A priority queue can be used to imple- 
ment queue -remove jnin() and other queue operations 
in 0(lgn c ) time or better, where n c is the number of 
clients [Cor90]. A skip list could also provide expected 
0(lg n c ) time queue operations with low constant over- 
head [Pug90]. For small n c or heavily skewed ticket 
distributions, a simple sorted list is likely to be most 
efficient in practice. 

Figure 2 illustrates an example of stride scheduling. 
Three clients, A, B, and C, are competing for a time- 
shared resource with a 3 : 2 : 1 ticket ratio. For simplicity, 
a convenient stridei = 6 is used instead of a large number, 
yielding respective strides of 2, 3, and 6. The pass value 
of each client is plotted as a function of time. For each 
quantum, the client with the minimum pass value is 
selected, and its pass is advanced by its stride. Ties are 
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Figure 2: Stride Scheduling Example. Clients A (trian- 
gles), B (circles), and C (squares) have a 3 : 2 : 1 ticket ratio. 
In this example, stridei = 6, yielding respective strides of 2, 
3, and 6. For each quantum, the client with the minimum pass 
value is selected, and its pass is advanced by its stride. 

broken using the arbitrary but consistent client ordering 

A, B, C. 

2.2 Dynamic Client Participation 

The algorithm presented in Figure 1 does not support 
dynamic changes in the number of clients competing for 
a resource. When clients are allowed to join and leave 
at any time, their state must be appropriately modified. 
Figure 3 extends the basic algorithm to efficiently handle 
dynamic changes. 

A key extension is the addition of global variables 
that maintain aggregate information about the set of ac- 
tive clients. The globalJickets variable contains the 
total ticket sum for all active clients. The global .pass 
variable maintains the "current" pass for the scheduler. 
The global^ass advances at the rate of global stride per 
quantum, where global^stride = stride^ I globalJickets. 
Conceptually, the global^ass continuously advances at 
a smooth rate. This is implemented by invoking the 
global j)assjupdate() routine whenever the global^ass 
value is needed. 3 



A state variable is also associated with each client 
to store the remaining portion of its stride when a dy- 
namic change occurs. The remain field represents the 
number of passes that are left before a client's next se- 
lection. When a client leaves the system, remain is 
computed as the difference between the client's pass 
and the global^ass. When a client rejoins the system, 
its pass value is recomputed by adding its remain value 
to the global ^pass. 

This mechanism handles situations involving either 
positive or negative error between the specified and ac- 
tual number of allocations. If remain < stride, then 
the client is effectively given credit when it rejoins for 
having previously waited for part of its stride without 
receiving a quantum. If remain > stride, then the client 
is effectively penalized when it rejoins for having previ- 
ously received a quantum without waiting for its entire 
stride. 4 

This approach makes an implicit assumption that a 
partial quantum now is equivalent to a partial quantum 
later. In general, this is a reasonable assumption, and 
resembles the treatment of nonuniform quanta that will 
be presented Section 2.4. However, it may not be ap- 
propriate if the total number of tickets competing for 
a resource varies significantly between the time that a 
client leaves and rejoins the system. 

The time complexity for both the client Jeave( ) and 
client -join() operations is 0(lg n c ), where n c is the num- 
ber of clients. These operations are efficient because the 
stride scheduling state associated with distinct clients is 
completely independent; a change to one client does not 
require updates to any other clients. The 0(lgn c ) cost 
results from the need to perform queue manipulations. 

2.3 Dynamic Ticket Modifications 

Additional support is needed to dynamically modify 
client ticket allocations. Figure 4 illustrates a dynamic 
allocation change, and Figure 5 lists ANSI C code for 



Due to the use of a fixed-point integer representation for 
strides, small quantization errors may accumulate slowly, causing 



global^iass to drift away from client pass values over a long period 
of time. This is unlikely to be a practical problem, since client pass 
values are recomputed using global-pass each time they leave and 
rejoin the system. However, this problem can be avoided by very 
infrequently resetting global^ass to the minimum pass value for the 
set of active clients. 

4 Several interesting alternatives could also be implemented. For 
example, a client could be given credit for some or all of the passes 
that elapse while it is inactive. 



/* per-client state */ 
typedef struct { 

int tickets, stride, pass, remain; 
} *client_t; 

/* quantum in real time units (e.g. 1M cycles) */ 
const int quantum = (1 << 20); 

/* large integer stride constant (e.g. 1M) */ 
const int stridel = (1 << 20); 

/* current resource owner */ 
client.t current; 

/* global aggregate tickets, stride, pass */ 

int global.tickets, global_stride, global.pass; 

/* update global pass based on elapsed real time */ 
void global.pass.update (void) 

{ 

static int last.update = 0; 
int elapsed; 

/* compute elapsed time, advance last.update */ 

elapsed = timet) - last.update; 
last.update += elapsed; 

/* advance global pass by quantum-adjusted stride */ 

global.pass += 

(global-stride * elapsed) / quantum; 
} 

/* update global tickets and stride to reflect change */ 
void global_tickets_update(int delta) 

{ 

global-tickets += delta; 
global-Stride = stridel / global-tickets; 
} 



/* join competition for resource */ 

void client-join (client.t c, queue.t q) 

{ 

/* compute pass for next allocation */ 

global-pass-update ( ) ; 

c->pass = global_pass + c->remain; 



/* add to queue */ 

global-tickets-update (c->tickets) ; 
queue-insert (q, c) ; 



} 



/* leave competition for resource */ 

void client-leave (client.t c, queue.t q) 

{ 

/* compute remainder of current stride */ 

global.pass.update ( ) ; 

c->remain = c->pass - global_pass; 

/* remove from queue */ 

global.tickets.update (-c->tickets) ; 
queue.remove (q, c) ; 



} 



/* proportional-share resource allocation */ 

void allocate (queue.t q) 

{ 

int elapsed; 

/* select client with minimum pass value */ 

current = queue_removejnin (q) ; 

/* use resource, measuring elapsed real time */ 

elapsed = use.resource (current) ; 

/* compute next pass using quantum-adjusted stride */ 

current->pass += 

(current->stride * elapsed) / quantum; 
queue.insert (q, current); 



/* initialize client with specified allocation */ 
void client jnit (client.t c, int tickets) 

{ 

/* stride is inverse of tickets, whole stride remains */ 

c->tickets = tickets; 
c->stride = stridel / tickets; 
c->remain = c->stride; 

} 



Figure 3: Dynamic Stride Scheduling Algorithm. ANSI C code for stride scheduling operations, including support for 
joining, leaving, and nonuniform quanta. Queue manipulations can be performed in 0(lg n c ) time by using an appropriate data 
structure. 
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Figure 4: Allocation Change. Modifying a client's al- 
location from tickets to tickets' requires only a constant-time 
recomputation of its stride and pass. The new stride 1 is in- 
versely proportional to tickets' . The new pass' is determined 
by scaling remain, the remaining portion of the the current 
stride, by stride' I stride. 



dynamically changing a client's ticket allocation. When 
a client's allocation is dynamically changed from tickets 
to tickets' , its stride and pass values must be recom- 
puted. The new stride' is computed as usual, inversely 
proportional to tickets' . To compute the new pass' , the 
remaining portion of the client's current stride, denoted 
by remain, is adjusted to reflect the new stride' . This 
is accomplished by scaling remain by stride' I stride. 
In Figure 4, the client's ticket allocation is increased, 
so pass is decreased, compressing the time remaining 
until the client is next selected. If its allocation had de- 
creased, then pass would have increased, expanding the 
time remaining until the client is next selected. 

The client jmodify{) operation requires 0(lg n c ) time, 
where n c is the number of clients. As with dy- 
namic changes to the number of clients, ticket allocation 
changes are efficient because the stride scheduling state 
associated with distinct clients is completely indepen- 
dent; the dominant cost is due to queue manipulations. 



/* dynamically modify client ticket allocation */ 

void clientjnodify (client _t c, queue.t q, int tickets) 

{ 

int remain, stride; 

/* leave queue for resource */ 

client-leave (c, q) ; 

/* compute new stride */ 

stride = stridel / tickets; 

/* scale remaining passes to reflect change in stride */ 

remain = (c->remain * stride) / c->stride; 

/* update client state */ 

c->tickets = tickets; 
c->stride = stride; 
c->remain = remain; 

/* rejoin queue for resource */ 

client-join (c, q) ; 



Figure 5: Dynamic Ticket Modification. ANSI C code 
for dynamic modifications to client ticket allocations. Queue 
manipulations can be performed in 0(lgn c ) time by using 
an appropriate data structure. 



2.4 Nonuniform Quanta 

With the basic stride scheduling algorithm presented in 
Figure 1 , a client that does not consume its entire allo- 
cated quantum would receive less than its entitled share 
of a resource. Similarly, it may be possible for a client's 
usage to exceed a standard quantum in some situations. 
For example, under a non-preemptive scheduler, client 
run lengths can vary considerably. 

Fortunately, fractional and variable-size quanta can 
easily be accommodated. When a client consumes a 
fraction / of its allocated time quantum, its pass should 
be advanced by / x stride instead of stride. If / < 1, 
then the client's pass will be increased less, and it will 
be scheduled sooner. If / > 1, then the client's pass 
will be increased more, and it will be scheduled later. 
The extended code listed in Figure 3 supports nonuni- 
form quanta by effectively computing / as the elapsed 
resource usage time divided by a standard quantum in 
the same time units. 

Another extension would permit clients to specify 
the quantum size that they require. 5 This could be im- 
plemented by associating an additional quantum c field 
with each client, and scaling each client's stride field by 



An alternative would be to allow a client to specify its scheduling 
period. Since a client's period and quantum are related by its relative 
resource share, specifying one quantity yields the other. 



quantum c I quantum. Deviations from a client's speci- 
fied quantum would still be handled as described above, 
with / redefined as the elapsed resource usage divided 
by the client-specific quantum c . 

3 Flexible Resource Management 

Since stride scheduling enables low-overhead dynamic 
modifications, it can efficiently support the flexible re- 
source management abstractions introduced with lottery 
scheduling [Wal94]. In this section, we explain how 
ticket transfers, ticket inflation, and ticket currencies 
can be implemented on top of a stride -based substrate 
for proportional sharing. 

3.1 Ticket Transfers 

A ticket transfer is an explicit transfer of tickets from 
one client to another. Ticket transfers are particularly 
useful when one client blocks waiting for another. For 
example, during a synchronous RPC, a client can loan its 
resource rights to the server computing on its behalf. A 
transfer of t tickets between clients A and B essentially 
consists of two dynamic ticket modifications. Using 
the code presented in Figure 5, these modifications are 
implemented by invoking clientjnodify(A, q, A. tickets 
- t) and client _modify(B, q, B. tickets + t). When A 
transfers tickets to B, A's stride and pass will increase, 
while B's stride and pass will decrease. 

A slight complication arises in the case of a complete 
ticket transfer; i.e., when A transfers its entire ticket al- 
location to B. In this case, ^4's adjusted ticket value is 
zero, leading to an adjusted stride of infinity (division 
by zero). To circumvent this problem, we record the 
fraction of ^4's stride that is remaining at the time of the 
transfer, and then adjust that remaining fraction when 
A once again obtains tickets. This can easily be imple- 
mented by computing ^4's remain value at the time of the 
transfer, and deferring the computation of its stride and 
pass values until A receives a non-zero ticket allocation 
(perhaps via a return transfer from B). 

3.2 Ticket Inflation 

An alternative to explicit ticket transfers is ticket infla- 
tion, in which a client can escalate its resource rights 
by creating more tickets. Ticket inflation (or deflation) 



simply consists of a dynamic ticket modification for a 
client. Ticket inflation causes a client's stride and pass to 
decrease; deflation causes its stride and pass to increase. 
Ticket inflation is useful among mutually trusting 
clients, since it permits resource rights to be reallocated 
without explicitly reshuffling tickets among clients. 
However, ticket inflation is also dangerous, since any 
client can monopolize a resource simply by creating a 
large number of tickets. In order to avoid the dangers 
of inflation while still exploiting its advantages, we in- 
troduced a currency abstraction for lottery scheduling 
[Wal94] that is loosely borrowed from economics. 

3.3 Ticket Currencies 

A ticket currency defines a resource management ab- 
straction barrier that contains the effects of ticket in- 
flation in a modular way. Tickets are denominated in 
currencies, allowing resource rights to be expressed in 
units that are local to each group of mutually trusting 
clients. Each currency is backed, ox funded, by tick- 
ets that are denominated in more primitive currencies. 
Currency relationships may form an arbitrary acyclic 
graph, such as a hierarchy of currencies. The effects of 
inflation are locally contained by effectively maintain- 
ing an exchange rate between each local currency and a 
common base currency that is conserved. The currency 
abstraction is useful for flexibly naming, sharing, and 
protecting resource rights. 

The currency abstraction introduced for lottery 
scheduling can also be used with stride scheduling. One 
implementation technique is to always immediately con- 
vert ticket values denominated in arbitrary currencies 
into units of the common base currency. Any changes 
to the value of a currency would then require dynamic 
modifications to all clients holding tickets denominated 
in that currency, or one derived from it. 6 Thus, the 
scope of any changes in currency values is limited to 
exactly those clients which are affected. Since curren- 
cies are used to group and isolate logical sets of clients, 
the impact of currency fluctuations will typically be very 
localized. 



An important exception is that changes to the number of tick- 
ets in the base currency do not require any modifications. This is 
because all stride scheduling state is computed from ticket values 
expressed in base units, and the state associated with distinct clients 
is independent. 



4 Hierarchical Stride Scheduling 



Stride scheduling guarantees that the relative throughput 
error for any pair of clients never exceeds a single quan- 
tum. However, depending on the distribution of tickets 
to clients, a large 0{n c ) absolute throughput error is 
still possible, where n c is the number of clients. 

For example, consider a set of 101 clients with a 
100 : 1 : . . . : 1 ticket allocation. A schedule that mini- 
mizes absolute error and response time variability would 
alternate the 100-ticket client with each of the single- 
ticket clients. However, the standard stride algorithm 
schedules the clients in order, with the 100-ticket client 
receiving 100 quanta before any other client receives 
a single quantum. Thus, after 100 allocations, the in- 
tended allocation for the 100-ticket client is 50, while 
its actual allocation is 100, yielding a large absolute 
error of 50. This behavior is also exhibited by sim- 
ilar rate -based flow control algorithms for networks 
[Dem90, Zha91, ZhK91, Par93]. 

In this section we describe a novel hierarchical variant 
of stride scheduling that limits the absolute throughput 
error of any client to 0(lgn c ) quanta. For the 101 -client 
example described above, hierarchical stride scheduler 
simulations produced a maximum absolute error of only 
4.5. Our algorithm also significantly reduces response 
time variability by aggregating clients to improve in- 
terleaving. Since it is common for systems to consist 
of a small number of high-throughput clients together 
with a large number of low-throughput clients , hierarchi- 
cal stride scheduling represents a practical improvement 
over previous work. 

4.1 Basic Algorithm 

Hierarchical stride scheduling is essentially a recur- 
sive application of the basic stride scheduling algo- 
rithm. Individual clients are combined into groups with 
larger aggregate ticket allocations, and correspondingly 
smaller strides. An allocation is performed by invok- 
ing the normal stride scheduling algorithm first among 
groups, and then among individual clients within groups. 

Although many different groupings are possible, we 
consider a balanced binary tree of groups. Each leaf 
node represents an individual client. Each internal node 
represents the group of clients (leaf nodes) that it covers, 
and contains their aggregate tickets, stride, and pass 



/* binary tree node */ 
typedef struct node { 

struct node *left, *right, *parent; 
int tickets, stride, pass; 
} *node_t; 

/* quantum in real time units (e.g. 1M cycles) */ 
const int quantum = (1 << 20); 

/* large integer stride constant (e.g. 1M) */ 
const int stridel = (1 « 20); 

/* current resource owner */ 
client. t current; 

/* proportional-share resource allocation */ 
void allocate (node.t root) 

{ 

int elapsed; 

node.t n; 

/* traverse root-to-leaf path following min pass */ 

for (n = root; ! node.is.leaf (n) ; ) 
if (n->left == NULL | 

n->right->pass < n->left->pass) 
n = n->right; 
else 

n = n->left; 

/* use resource, measuring elapsed real time */ 

current = n; 

elapsed = use_resource (current) ; 

/* update pass for each ancestor using its stride */ 

for (n = current; n != NULL; n = n->parent) 
n->pass += (n->stride * elapsed) / quantum; 



Figure 6: Hierarchical Stride Scheduling Algorithm. 

ANSI C code for hierachical stride scheduling with a static set 
of clients. The main data structure is a binary tree of nodes. 
Each node represents either a client (leaf) or a group (internal 
node) that summarizes aggregate information. 



values. Thus, for an internal node, tickets is the total 
ticket sum for all of the clients that it covers, and stride 
= stridei I tickets. The pass value for an internal node 
is updated whenever the pass value for any of the clients 
that it covers is modified. 

Figure 6 presents ANSI C code for the basic hierar- 
chical stride scheduling algorithm. Each node has the 
normal tickets, stride, and pass scheduling state, as well 
as the usual tree links to its parent, left child, and right 
child. An allocation is performed by tracing a path from 
the root of the tree to a leaf, choosing the child with the 
smaller pass value at each level. Once the selected client 
has finished using the resource, its pass value is updated 
to reflect its usage. The client update is identical to 
that used in the dynamic stride algorithm that supports 
nonuniform quanta, listed in Figure 3. However, the hi- 
erarchical scheduler requires additional updates to each 
of the client's ancestors, following the leaf-to-root path 
formed by successive parent links. 

Each client allocation can be viewed as a series of 
pairwise allocations among groups of clients at each 
level in the tree. The maximum error for each pairwise 
allocation is 1 , and in the worst case, error can accumu- 
late at each level. Thus, the maximum absolute error 
for the overall tree-based allocation is the height of the 
tree, which is [lg n c ] , where n c is the number of clients. 
Since the error for a pairwise A : B ratio is minimized 
when A = B, absolute error can be further reduced by 
carefully choosing client leaf positions to better balance 
the tree based on the number of tickets at each node. 



/* dynamically modify node allocation by delta tickets */ 
void node_modify(node_t n, node.t root, int delta) 

{ 

int old.stride, remain; 

/* compute new tickets, stride */ 

old.stride = n->stride; 

n->tickets += delta; 

n->stride = stridei / n->tickets; 

/* done when reach root */ 

if (n == root) 
return; 

/* scale remaining passes to reflect change in stride '*/ 

remain = n->pass - root->pass; 

remain = (remain * n->stride) / old-Stride; 

n->pass = root->pass + remain; 

/* propagate change to ancestors */ 

node-modify (n->parent, root, delta); 



4.2 Dynamic Modifications 

Extending the basic hierarchical stride algorithm to 
support dynamic modifications requires a careful consid- 
eration of the effects of changes at each level in the tree. 
Figure 7 lists ANSI C code for performing a ticket mod- 
ification that works for both clients and internal nodes. 
Changes to client ticket allocations essentially follow 
the same scaling and update rules used for normal stride 
scheduling, listed in Figure 5. The hierarchical sched- 
uler requires additional updates to each of the client's 
ancestors, following the leaf-to-root path formed by suc- 
cessive parent links. Note that the root pass value used 
in Figure 7 effectively takes the place of the global_pass 
variable used in Figure 5; both represent the aggregate 
global scheduler pass. 



Figure 7: Dynamic Ticket Modification. ANSI C code 
for dynamic modifications to client ticket allocations un- 
der hierarchical stride scheduling. A modification requires 
0(lg n c ) time to propagate changes. 



Although not presented here, we have also devel- 
oped operations to support dynamic client participa- 
tion under hierarchical stride scheduling [Wal95]. As 
for allocate(), the time complexity for client -join() and 
client Jeave() operations is 0(lgn c ), where n c is the 
number of clients. 



5 Simulation Results 

This section presents the results of several quantitative 
experiments designed to evaluate the effectiveness of 
stride scheduling. We examine the behavior of stride 
scheduling in both static and dynamic environments, 
and also test hierarchical stride scheduling. When stride 
scheduling is compared to lottery scheduling, we find 
that the stride -based approach provides more accurate 
control over relative throughput rates, with much lower 
variance in response times. 

For example, Figure 8 presents the results of schedul- 
ing three clients with a 3 : 2 : 1 ticket ratio for 100 al- 
locations. The dashed lines represent the ideal alloca- 
tions for each client. It is clear from Figure 8(a) that 
lottery scheduling exhibits significant variability at this 
time scale, due to the algorithm's inherent use of ran- 
domization. In contrast, Figure 8(b) indicates that the 
deterministic stride scheduler produces precise periodic 
behavior. 



5.1 Throughput Accuracy 

Under randomized lottery scheduling, the expected 
value for the absolute error between the specified and 
actual number of allocations for any set of clients is 
0(^/n^), where n a is the number of allocations. This 
is because the number of lotteries won by a client has 
a binomial distribution. The probability p that a client 
holding t tickets will win a given lottery with a total of T 
tickets is simply p = t/T. After n a identical lotteries, 
the expected number of wins w is E[w] = n a p, with 
variance a 2 w = n a p(l —p). 

Under deterministic stride scheduling, the relative er- 
ror between the specified and actual number of alloca- 
tions for any pair of clients never exceeds one, indepen- 
dent of n a . This is because the only source of relative 
error is due to quantization. 
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Figure 8: Lottery vs. Stride Scheduling. Simulation 
results for 100 allocations involving three clients, A, B, and 
C, with a 3 : 2 : 1 allocation. The dashed lines represent ideal 
proportional- share behavior, (a) Allocation by randomized 
lottery scheduler shows significant variability, (b) Allocation 
by deterministic stride scheduler exhibits precise periodic be- 
havior: A, B, A, A, B, C. 
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Figure 9: Throughput Accuracy. Simulation results for two clients with 7 : 3 (top) and 19:1 (bottom) ticket ratios over 1000 
allocations. Only the first 100 quanta are shown for the stride scheduler, since its quantization error is deterministic and periodic, 
(a) Mean lottery scheduler error, averaged over 1000 separate 7 : 3 runs, (b) Stride scheduler error for a single 7 : 3 run. (c) Mean 
lottery scheduler error, averaged over 1000 separate 19:1 runs, (d) Stride scheduler error for a single 19:1 run. 
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Figure 9 plots the absolute error 7 that results from 
simulating two clients under both lottery scheduling and 
stride scheduling. The data depicted is representative 
of our simulation results over a large range of pairwise 
ratios. Figure 9(a) shows the mean error averaged over 
1000 separate lottery scheduler runs with a 7 : 3 ticket 
ratio. As expected, the error increases slowly with n a , 
indicating that accuracy steadily improves when error 
is measured as a percentage of n a . Figure 9(b) shows 
the error for a single stride scheduler run with the same 
7 : 3 ticket ratio. As expected, the error never exceeds 
a single quantum, and follows a deterministic pattern 
with period 10. The error drops to zero at the end of 
each complete period, corresponding to a precise 7 : 3 
allocation. Figures 9(c) and 9(d) present data for similar 
experiments involving a larger 19:1 ticket ratio. 

5.2 Dynamic Ticket Allocations 

Figure 10 plots the absolute error that results from 
simulating two clients under both lottery scheduling and 
stride scheduling with rapidly-changing dynamic ticket 
allocations. This data is representative of simulation re- 
sults over a large range of pairwise ratios and a variety 
of dynamic modification techniques. For easy compar- 
ison, the average dynamic ticket ratios are identical to 
the static ticket ratios used in Figure 9. 

The notation [^4,.B] indicates a random ticket allo- 
cation that is uniformly distributed from A to B. New, 
randomly-generated ticket allocations were dynamically 
assigned every other quantum. The client jnodify() oper- 
ation was executed for each change under stride schedul- 
ing; no special actions were necessary under lottery 
scheduling. To compute error values, specified allo- 
cations were determined incrementally. Each client's 
specified allocation was advanced by t/T on every quan- 
tum, where t is the client's current ticket allocation, and 
T is the current ticket total. 

Figure 10(a) shows the mean error averaged over 1000 
separate lottery scheduler runs with a [2,12] : 3 ticket ra- 
tio. Despite the dynamic changes, the mean error is 
nearly the same as that measured for the static 7 : 3 ratio 
depicted in Figure 9(a). Similarly, Figure 10(b) shows 
the error for a single stride scheduler run with the same 



In this case the relative and absolute errors are identical, since 
there are only two clients. 



dynamic [2,12] : 3 ratio. The error never exceeds a sin- 
gle quantum, although it is much more erratic than the 
periodic pattern exhibited for the static 7 : 3 ratio in Fig- 
ure 9(b). Figures 10(c) and 10(d) present data for similar 
experiments involving a larger dynamic 190 : [5,15] ra- 
tio. The results for this allocation are comparable to 
those measured for the static 19:1 ticket ratio depicted 
in Figures 9(c) and 9(d). 

Overall, the error measured under both lottery 
scheduling and stride scheduling is largely unaffected 
by dynamic ticket modifications. This suggests that both 
mechanisms are well-suited to dynamic environments. 
However, stride scheduling is clearly more accurate in 
both static and dynamic environments. 

5.3 Response Time Variability 

Another important performance metric is response time, 
which we measure as the elapsed time from a client's 
completion of one quantum up to and including its com- 
pletion of another. Under randomized lottery schedul- 
ing, client response times have a geometric distribution. 
The expected number of lotteries n a that a client must 
wait before its first win is -E[n a ] = 1/p, with variance 
o 2 n = (1 —p)/p 2 - Deterministic stride scheduling 
exhibits dramatically less response-time variability. 

Figures 1 1 and 12 present client response time distri- 
butions under both lottery scheduling and stride schedul- 
ing. Figure 1 1 shows the response times that result from 
simulating two clients with a 7 : 3 ticket ratio for one mil- 
lion allocations. The stride scheduler distributions are 
very tight, while the lottery scheduler distributions are 
geometric with long tails. For example, the client with 
the smaller allocation had a maximum response time of 
4 quanta under stride scheduling, while the maximum 
response time under lottery scheduling was 39. 

Figure 12 presents similar data for a larger 19 : 1 ticket 
ratio. Although there is little difference in the response 
time distributions for the client with the larger allocation, 
the difference is enormous for the client with the smaller 
allocation. Under stride scheduling, virtually all of the 
response times were exactly 20 quanta. The lottery 
scheduler produced geometrically-distributed response 
times ranging from 1 to 194 quanta. In this case, the 
standard deviation of the stride scheduler's distribution 
is three orders of magnitude smaller than the standard 
deviation of the lottery scheduler's distribution. 
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Figure 10: Throughput Accuracy - Dynamic Allocations. Simulation results for two clients with [2,12] : 3 (top) and 
190 : [5,15] (bottom) ticket ratios over 1000 allocations. The notation [ A,B] indicates a random ticket allocation that is uniformly 
distributed from A to B. Random ticket allocations were dynamically updated every other quantum, (a) Mean lottery scheduler 
error, averaged over 1000 separate [2,12] : 3 runs, (b) Stride scheduler error for a single [2,12] : 3 run. (c) Mean lottery scheduler 
error, averaged over 1000 separate 190 : [5,15] runs, (d) Stride scheduler error for a single 190 : [5,15] run. 
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Figure 1 1 : Response Time Distribution. Simulation results for two clients with a 7 : 3 ticket ratio over one million 
allocations, (a) Client with 7 tickets under lottery scheduling: fj, = 1.43, a = 0.78. (b) Client with 7 tickets under stride 
scheduling: /i = 1.43, a = 0.49. (c) Client with 3 tickets under lottery scheduling: /i = 3.33, a = 2.79. (d) Client with 3 
tickets under stride scheduling: fj, = 3.33, a = 0.47. 
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Figure 12: Response Time Distribution. Simulation results for two clients with a 19: 1 ticket ratio over one million 
allocations, (a) Client with 19 tickets under lottery scheduling: fj, = 1.05, a = 0.24. (b) Client with 19 tickets under stride 
scheduling: /i = 1.05, a = 0.22. (c) Client with 1 ticket under lottery scheduling: /i = 20.13, a = 19.64. (d) Client with 1 
ticket under stride scheduling: fj, = 20.00, a = 0.01. 
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5.4 Hierarchical Stride Scheduling 



As discussed in Section 4, stride scheduling can produce 
an absolute error of 0(n c ) for skewed ticket distribu- 
tions, where n c is the number of clients. In contrast, 
hierarchical stride scheduling bounds the absolute er- 
ror to 0(lgn c ). As a result, response-time variability 
can be significantly reduced under hierarchical stride 
scheduling. 

Figure 13 presents client response time distributions 
under both hierarchical stride scheduling and ordinary 
stride scheduling. Eight clients with a 7 : 1 : . . . : 1 ticket 
ratio were simulated for one million allocations. Ex- 
cluding the very first allocation, the response time for 
each of the low -throughput clients was always 14, under 
both schedulers. Thus we only present response time 
distributions for the high-throughput client. 

The ordinary stride scheduler runs the high- 
throughput client for 7 consecutive quanta, and then 
runs each of the low -throughput clients for one quan- 
tum. The hierarchical stride scheduler interleaves the 
clients, resulting in a tighter distribution. In this case, 
the standard deviation of the ordinary stride scheduler's 
distribution is more than twice as large as that for the 
hierarchical stride scheduler. We observed a maximum 
absolute error of 4 quanta for the high-throughput client 
under ordinary stride scheduling, and only 1.5 quanta 
under hierarchical stride scheduling. 

6 Prototype Implementations 

We implemented two prototype stride schedulers by 
modifying the Linux 1.1.50 kernel on a 25MHz i486- 
based IBM Thinkpad 350C. The first prototype enables 
proportional-share control over processor time, and the 
second enables proportional-share control over network 
transmission bandwidth. 

6.1 Process Scheduler 

The goal of our first prototype was to permit 
proportional-share allocation of processor time to con- 
trol relative computation rates. We primarily changed 
the kernel code that handles process scheduling, switch- 
ing from a conventional priority scheduler to a stride- 
based algorithm with a scheduling quantum of 100 mil- 
liseconds. Ticket allocations can be specified via a new 
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Figure 13: Hierarchical Stride Scheduling. Response 
time distributions for a simulation of eight clients with a 
7:1:... : 1 ticket ratio over one million allocations. Re- 
sponse times are shown only for the client with 7 tickets, (a) 
Hierarchical Stride Scheduler: /i = 2.00, a = 1.07. (b) 
Ordinary Stride Scheduler: fi = 2.00, a = 2.45. 
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Figure 14: CPU Rate Accuracy. For each allocation 
ratio, the observed iteration ratio is plotted for each of three 
30 second runs. The gray line indicates the ideal where the 
two ratios are identical. The observed ratios are within 1% of 
the ideal for all data points. 



Figure 15: CPU Fairness Over Time. Two processes 
executing the compute-bound arith benchmark with a 3 : 1 
ticket allocation. Averaged over the entire run, the two pro- 
cesses executed 2409.18 and 802.89 iterations/sec, for an 
actual ratio of 3.001 : 1 . 



stride_cpu_set_tickets () system call. We did not 
implement support for higher-level abstractions such as 
ticket transfers and currencies. Fewer than 300 lines of 
source code were added or modified to implement our 
changes. 

Our first experiment tested the accuracy with which 
our prototype could control the relative execution rate 
of computations. Each point plotted in Figure 14 indi- 
cates the relative execution rate that was observed for 
two processes running the compute -bound arith inte- 
ger arithmetic benchmark [Byt91]. Three thirty-second 
runs were executed for each integral ratio between one 
and ten. In all cases, the observed ratios are within 1% 
of the ideal. We also ran experiments involving higher 
ratios, and found that the observed ratio for a 20 : 1 al- 
location ranged from 19.94 to 20.04, and the observed 
ratio for a 50 : 1 allocation ranged from 49.93 to 50.44. 

Our next experiment examined the scheduler's behav- 
ior over shorter time intervals. Figure 15 plots average 
iteration counts over a series of 2-second time windows 
during a single 60 second execution with a 3 : 1 alloca- 
tion. The two processes remain close to their allocated 



ratios throughout the experiment. Note that if we used a 
10 millisecond time quantum instead of the scheduler's 
100 millisecond quantum, the same degree of fairness 
would be observed over a series of 200 millisecond time 
windows. 

To assess the overhead imposed by our prototype 
stride scheduler, we ran performance tests consisting 
of concurrent arith benchmark processes. Overall, we 
found that the performance of our prototype was com- 
parable to that of the standard Linux process scheduler. 
Compared to unmodified Linux, groups of 1, 2, 4, and 
8 arith processes each completed fewer iterations un- 
der stride scheduling, but the difference was always less 
than 0.2%. 

However, neither the standard Linux scheduler nor 
our prototype stride scheduler are particularly efficient. 
For example, the Linux scheduler performs a linear scan 
of all processes to find the one with the highest priority. 
Our prototype also performs a linear scan to find the 
process with the minimum pass; an 0(lgn c ) time im- 
plementation would have required substantial changes 
to existing kernel code. 
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6.2 Network Device Scheduler 

The goal of our second prototype was to permit 
proportional-share control over transmission bandwidth 
for network devices such as Ethernet and SLIP inter- 
faces. Such control would be particularly useful for 
applications such as concurrent ftp file transfers, and 
concurrent http Web server replies. For example, many 
Web servers have relatively slow connections to the In- 
ternet, resulting in substantial delays for transfers of 
large objects such as graphical images. Given control 
over relative transmission rates, a Web server could pro- 
vide different levels of service to concurrent clients. For 
example, tickets 8 could be issued by servers based upon 
the requesting user, machine, or domain. Commercial 
servers could even sell tickets to clients demanding faster 
service. 

We primarily changed the kernel code that han- 
dles generic network device queueing. This involved 
switching from conventional FIFO queueing to stride- 
based queueing that respects per-socket ticket alloca- 
tions. Ticket allocations can be specified via a new 
SO.TICKETS option to the setsockopt ( ) system call. 
Although not implemented in our prototype, a more 
complete system should also consider additional forms 
of admission control to manage other system resources, 
such as network buffers. Fewer than 300 lines of source 
code were added or modified to implement our changes. 

Our first experiment tested the prototype's ability to 
control relative network transmission rates on a local 
area network. We used the ttcp network test program 9 
[TTC91] to transfer fabricated buffers from an IBM 
Thinkpad 350C running our modified Linux kernel, to a 
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To be included with http requests, tickets would require an 
external data representation. If security is a concern, cryptographic 
techniques could be employed to prevent forgery and theft. 

9 We made a few minor modifications to the standard 1 1 cp bench- 
mark. Other than extensions to specify ticket allocations and facili- 
tate coordinated timing, we also decreased the value of a hard-coded 
delay constant. This constant is used to temporarily put a trans- 
mitting process to sleep when it is unable to write to a socket due 
to a lack of buffer space (ENOBUFS). Without this modification, the 
observed throughput ratios were consistently lower than specified 
allocations, with significant differences for large ratios. With the 
larger delay constant, we believe that the low-throughput client is 
able to continue sending packets while the high-throughput client is 
sleeping, distorting the intended throughput ratio. Of course, chang- 
ing the kernel interface to signal a process when more buffer space 
becomes available would probably be preferable to polling. 
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Figure 16: Ethernet UDP Rate Accuracy. For each 
allocation ratio, the observed data transmission ratio is plotted 
for each of three runs. The gray line indicates the ideal where 
the two ratios are identical. The observed ratios are within 
5% of the ideal for all data points. 

DECStation 5000/133 running Ultrix. Both machines 
were on the same physical subnet, connected via a 
10Mbps Ethernet that also carried network traffic for 
other users. 

Each point plotted in Figure 16 indicates the rela- 
tive UDP data transmission rate that was observed for 
two processes running the ttcp benchmark. Each ex- 
periment started with both processes on the sending ma- 
chine attempting to transmit 4K buffers, each containing 
8Kbytes of data, for a total 32Mbyte transfer. As soon 
as one process finished sending its data, it terminated the 
other process via a Unix signal. Metrics were recorded 
on the receiving machine to capture end-to-end applica- 
tion throughput. The observed ratios are very accurate; 
all data points are within 5% of the ideal. For larger 
ticket ratios, the observed throughput ratio is slightly 
lower than the specified allocation. For example, a 20 : 1 
allocation resulted in actual throughput ratios ranging 
from 18.51:1 to 18.77:1. 

To assess the overhead imposed by our prototype, 
we ran performance tests consisting of concurrent ttcp 
benchmark processes. Overall, we found that the perfor- 
mance of our prototype was comparable to that of stan- 
dard Linux. Although the prototype increases the length 
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of the critical path for sending a network packet, we were 
unable to observe any significant difference between un- 
modified Linux and stride scheduling. We believe that 
the small additional overhead of stride scheduling was 
masked by the variability of external network traffic from 
other users; individual differences were in the range of 
±5%. 



7 Related Work 

We independently developed stride scheduling as a de- 
terministic alternative to the randomized selection as- 
pect of lottery scheduling [Wal94]. We then discov- 
ered that the core allocation algorithm used in stride 
scheduling is nearly identical to elements of rate-based 
flow-control algorithms designed for packet-switched 
networks [Dem90, Zha91, ZhK91, Par93]. Despite the 
relevance of this networking research, to the best of our 
knowledge it has not been discussed in the processor 
scheduling literature. In this section we discuss a va- 
riety of related scheduling work, including rate -based 
network flow control, deterministic proportional-share 
schedulers, priority schedulers, real-time schedulers, 
and microeconomic schedulers. 

7.1 Rate-Based Network Flow Control 

Our basic stride scheduling algorithm is very similar 
to Zhang's VirtualClock algorithm for packet-switched 
networks [Zha91]. In this scheme, a network switch 
orders packets to be forwarded through outgoing links. 
Every packet belongs to a client data stream, and each 
stream has an associated bandwidth reservation. A vir- 
tual clock is assigned to each stream, and each of its 
packets is stamped with its current virtual time upon ar- 
rival. With each arrival, the virtual clock advances by a 
virtual tick that is inversely proportional to the stream's 
reserved data rate. Using our stride-oriented terminol- 
ogy, a virtual tick is analogous to a stride, and a virtual 
clock is analogous to a pass value. 

The VirtualClock algorithm is closely related to the 
weighted fair queueing (WFQ) algorithm developed by 
Demers, Keshav, and Shenker [Dem90], and Parekh and 
Gallager's equivalent packet-by-packet generalized pro- 
cessor sharing (PGPS) algorithm [Par93]. One differ- 
ence that distinguishes WFQ and PGPS from Virtual- 



Clock is that they effectively maintain a global virtual 
clock. Arriving packets are stamped with their stream's 
virtual tick plus the maximum of their stream's virtual 
clock and the global virtual clock. Without this modi- 
fication, an inactive stream can later monopolize a link 
as its virtual clock caught up to those of active streams; 
such behavior is possible under the VirtualClock algo- 
rithm [Par93]. 

Our stride scheduler's use of a global_pass vari- 
able is based on the global virtual clock employed by 
WFQ/PGPS, which follows an update rule that produces 
a smoothly varying global virtual time. Before we be- 
came aware of the WFQ/PGPS work, we used a simpler 
global_pass update rule: global_pass was set to the pass 
value of the client that currently owns the resource. To 
see the difference between these approaches, consider 
the set of minimum pass values over time in Figure 2. 
Although the average pass value increase per quantum 
is 1, the actual increases occur in non-uniform steps. 
We adopted the smoother WFQ/PGPS virtual time rule 
to improve the accuracy of pass updates associated with 
dynamic modifications. 

To the best of our knowledge, our work on stride 
scheduling is the first cross-application of rate-based 
network flow control algorithms to scheduling other re- 
sources such as processor time. New techniques were 
required to support dynamic changes and higher-level 
abstractions such as ticket transfers and currencies. Our 
hierarchical stride scheduling algorithm is a novel recur- 
sive application of the basic technique that exhibits im- 
proved throughput accuracy and reduced response time 
variability compared to prior schemes. 

7.2 Proportional-Share Schedulers 

Several other deterministic approaches have recently 
been proposed for proportional-share processor schedul- 
ing [Fon95, Mah95, Sto95]. However, all require expen- 
sive operations to transform client state in response to 
dynamic changes. This makes them less attractive than 
stride scheduling for supporting dynamic or distributed 
environments. Moreover, although each algorithm is ex- 
plicitly compared to lottery scheduling, none provides 
efficient support for the flexible resource management 
abstractions introduced with lottery scheduling. 

Stoica and Abdel-Wahab have devised an interesting 
scheduler using a deterministic generator that employs 
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a bit-reversed counter in place of the random number 
generator used by lottery scheduling [Sto95]. Their al- 
gorithm results in an absolute error for throughput that 
is 0(lg n a ), where n a is the number of allocations. Al- 
locations can be performed efficiently in 0(lgn c ) time 
using a tree-based data structure, where n c is the number 
of clients. However, dynamic modifications to the set 
of active clients or their allocations require executing a 
relatively complex "restart" operation with 0(n c ) time 
complexity. Also, no support is provided for fractional 
or nonuniform quanta. 

Maheshwari has developed a deterministic charge- 
based proportional-share scheduler [Mah95]. Loosely 
based on an analogy to digitized line drawing, this 
scheme has a maximum relative throughput error of 
one quantum, and also supports fractional quanta. Al- 
though efficient in many cases, allocation has a worst- 
case 0{n c ) time complexity, where n c is the number 
of clients. Dynamic modifications require executing a 
"refund" operation with 0(n c ) time complexity. 

Fong and Squillante have introduced a general 
scheduling approach called time-function scheduling 
(TFS) [Fon95]. TFS is intended to provide differen- 
tial treatment of job classes, where specific throughput 
ratios are specified across classes, while jobs within each 
class are scheduled in a FCFS manner. Time functions 
are used to compute dynamic job priorities as a func- 
tion of the time each job has spent waiting since it was 
placed on the run queue. Linear functions result in pro- 
portional sharing: a job's value is equal to its waiting 
time multipled by its job-class slope, plus a job-class 
constant. An allocation is performed by selecting the 
job with the maximum time -function value. A naive 
implementation would be very expensive, but since jobs 
are grouped into classes, allocation can be performed in 
0(n) time, where n is the number of distinct classes. If 
time-function values are updated infrequently compared 
to the scheduling quantum, then a priority queue can be 
used to reduce the allocation cost to O(lgn), with an 
0(n lg n) cost to rebuild the queue after each update. 

When Fong and Squillante compared TFS to lottery 
scheduling, they found that although throughput accu- 
racy was comparable, the waiting time variance of low- 
throughput tasks was often several orders of magnitude 
larger under lottery scheduling. This observation is con- 
sistent with our simulation results involving response 



time, presented in Section 5. TFS also offers the poten- 
tial to specify performance goals that are more general 
than proportional sharing. However, when proportional 
sharing is the goal, stride scheduling has advantages in 
terms of efficiency and accuracy. 

7.3 Priority Schedulers 

Conventional operating systems typically employ prior- 
ity schemes for scheduling processes [Dei90, Tan92]. 
Priority schedulers are not designed to provide 
proportional-share control over relative computation 
rates, and are often ad-hoc. Even popular priority-based 
approaches such as decay-usage scheduling are poorly 
understood, despite the fact that they are employed by 
numerous operating systems, including Unix [Hel93]. 

Fair share schedulers allocate resources so that users 
get fair machine shares over long periods of time 
[Hen84, Kay88, Hel93]. These schedulers are layered 
on top of conventional priority schedulers, and dynam- 
ically adjust priorities to push actual usage closer to 
entitled shares. The algorithms used by these systems 
are generally complex, requiring periodic usage moni- 
toring, complicated dynamic priority adjustments, and 
administrative parameter setting to ensure fairness on a 
time scale of minutes. 

7.4 Real-Time Schedulers 

Real-time schedulers are designed for time -critical sys- 
tems [Bur91]. In these systems, which include many 
aerospace and military applications, timing require- 
ments impose absolute deadlines that must be met to 
ensure correctness and safety; a missed deadline may 
have dire consequences. One of the most widely 
used techniques in real-time systems is rate-monotonic 
scheduling, in which priorities are statically assigned 
as a monotonic function of the rate of periodic tasks 
[Liu73, Sha91]. The importance of a task is not re- 
flected in its priority; tasks with shorter periods are sim- 
ply assigned higher priorities. Bounds on total processor 
utilization (ranging from 69% to nearly 100%, depend- 
ing on various assumptions) ensure that rate monotonic 
scheduling will meet all task deadlines. Another pop- 
ular technique is earliest deadline scheduling, which 
always schedules the task with the closest deadline first. 
The earliest deadline approach permits high processor 
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utilization, but has increased runtime overhead due to 
the use of dynamic priorities; the task with the nearest 
deadline varies over time. 

In general, real-time schedulers depend upon very 
restrictive assumptions, including precise static knowl- 
edge of task execution times and prohibitions on task 
interactions. In addition, limitations are placed on pro- 
cessor utilization, and even transient overloads are disal- 
lowed. In contrast, the proportional-share model used by 
stride scheduling and lottery scheduling is designed for 
more general-purpose environments. Task allocations 
degrade gracefully in overload situations, and active 
tasks proportionally benefit from extra resources when 
some allocations are not fully utilized. These proper- 
ties facilitate adaptive applications that can respond to 
changes in resource availability. 

Mercer, Savage, and Tokuda recently introduced 
a higher-level processor capacity reserve abstraction 
[Mer94] for measuring and controlling processor usage 
in a microkernel system with an underlying real-time 
scheduler. Reserves can be passed across protection 
boundaries during interprocess communication, with an 
effect similar to our use of ticket transfers. While this ap- 
proach works well for many multimedia applications, its 
reliance on resource reservations and admission control 
is still more restrictive than the general-purpose model 
that we advocate. 

7.5 Microeconomic Schedulers 

Microeconomic schedulers are based on metaphors to 
resource allocation in real economic systems. Money 
encapsulates resource rights, and a price mechanism 
is used to allocate resources. Several microeconomic 
schedulers [Dre88, Mil88, Fer88, Fer89, Wal89, Wal92, 
Wel93] use auctions to determine prices and allocate re- 
sources among clients that bid monetary funds. Both the 
escalator algorithm proposed for uniprocessor schedul- 
ing [Dre88] and the distributed Spawn system [Wal92] 
rely upon auctions in which bidders increase their bids 
linearly over time. Since auction dynamics can be unex- 
pectedly volatile, auction-based approaches sometimes 
fail to achieve resource allocations that are proportional 
to client funding. The overhead of bidding also limits 
the applicability of auctions to relatively coarse-grained 
tasks. Other market-based approaches that do not rely 
upon auctions have also been applied to managing pro- 



cessor and memory resources [E1175, Har92, Che93]. 

Stride scheduling and lottery scheduling are compat- 
ible with a market-based resource management philoso- 
phy. Our mechanisms for proportional sharing provide a 
convenient substrate for pricing individual time-shared 
resources in a computational economy. For example, 
tickets are analogous to monetary income streams, and 
the number of tickets competing for a resource can be 
viewed as its price. Our currency abstraction for flexi- 
ble resource management is also loosely borrowed from 
economics. 

8 Conclusions 

We have presented stride scheduling, a determinis- 
tic technique that provides accurate control over rel- 
ative computation rates. Stride scheduling also effi- 
ciently supports the same flexible, modular resource 
management abstractions introduced by lottery schedul- 
ing. Compared to lottery scheduling, stride scheduling 
achieves significantly improved accuracy over relative 
throughput rates, with significantly less response time 
variability. However, lottery scheduling is conceptu- 
ally simpler than stride scheduling. For example, stride 
scheduling requires careful state updates for dynamic 
changes, while lottery scheduling is effectively stateless. 
The core allocation mechanism used by stride 
scheduling is based on rate -based flow-control algo- 
rithms for networks. One contribution of this paper is 
a cross-application of these algorithms to the domain of 
processor scheduling. New techniques were developed 
to support dynamic modifications to client allocations 
and resource right transfers between clients. We also in- 
troduced a new hierarchical stride scheduling algorithm 
that exhibits improved throughput accuracy and lower 
response time variability compared to prior schemes. 
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A Fixed-Point Stride Representation 

The precision of relative rates that can be achieved de- 
pends on both the value of stridei and the relative ratios 
of client ticket allocations. For example, with stridei 
= 2 20 , and a maximum ticket allocation of 2 10 tickets, 
ratios are represented with 10 bits of precision. Thus, 
ratios close to unity resulting from allocations that differ 
by only one part per thousand, such as 1001 : 1000, can 
be supported. 

Since stridei is a large integer, stride values will also 
be large for clients with small allocations. Since pass 
values are monotonically increasing, they will eventually 
overflow the machine word size after a large number of 
allocations. For a machine with 64-bit integers, this is 
not a practical problem. For example, with stridei = 2 20 
and a worst-case client tickets = 1, approximately 2 44 
allocations can be performed before overflow occurs. 
At one allocation per millisecond, centuries of real time 
would elapse before an overflow. 

For a machine with 32-bit integers, the pass values 
associated with all clients can be adjusted by subtract- 
ing the minimum pass value from all clients whenever 
an overflow is detected. Alternatively, such adjustments 
can periodically be made after a fixed number of allo- 
cations. For example, with stridei = 2 20 , a conservative 
adjustment period would be a few thousand allocations. 
Perhaps the most straightforward approach is to simply 
use a 64-bit integer type if one is available. Our pro- 
totype implementation makes use of the 64-bit "long 
long" integer type provided by the GNU C compiler. 
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